Few-shot (FS) and zero-shot (ZS) learning are two different approaches for scaling temporal action detection (TAD) to new classes. The former adapts a pretrained vision model to a new task represented by as few as a single video per class, whilst the latter requires no training examples by exploiting a semantic description of the new class. In this work, we introduce a new multi-modality few-shot (MMFS) TAD problem, which can be considered as a marriage of FS-TAD and ZS-TAD by leveraging few-shot support videos and new class names jointly. To tackle this problem, we further introduce a novel MUlti-modality PromPt mETa-learning (MUPPET) method. This is enabled by efficiently bridging pretrained vision and language models whilst maximally reusing already learned capacity. Concretely, we construct multi-modal prompts by mapping support videos into the textual token space of a vision-language model using a meta-learned adapter-equipped visual semantics tokenizer. To tackle large intra-class variation, we further design a query feature regulation scheme. Extensive experiments on ActivityNetv1.3 and THUMOS14 demonstrate that our MUPPET outperforms state-of-the-art alternative methods, often by a large margin. We also show that our MUPPET can be easily extended to tackle the few-shot object detection problem and again achieves the state-of-the-art performance on MS-COCO dataset. The code will be available in https://github.com/sauradip/MUPPET
translated by 谷歌翻译
最近发布的EGO4D数据集和基准测试显着缩放,并使第一人称视觉感知数据多样化。在EGO4D中,视觉查询2D本地化任务旨在从第一人称视图中的录制中检索过去出现的对象。此任务需要一个系统才能在空间和时间上定位给定对象查询的最新外观,其中查询在不同场景中被对象的单个紧密视觉作物注册。我们的研究基于情节记忆基准中引入的三阶段基线。基线通过检测和跟踪解决问题:检测所有帧中的相似对象,然后从最自信的检测结果中运行跟踪器。在VQ2D挑战中,我们确定了当前基线的两个局限性。 (1)训练配置具有冗余计算。尽管培训集有数百万个实例,但其中大多数是重复的,唯一对象的数量仅为14.6k。相同对象的重复梯度计算导致效率低下的训练; (2)背景框架上的误报率很高。这是由于培训和评估之间的分布差距。在培训期间,该模型只能看到干净,稳定和标记的框架,但是以自我为中心的视频也具有嘈杂,模糊或未标记的背景框架。为此,我们开发了一个更有效的解决方案。具体来说,我们将训练环从约15天提高到不到24小时,并且达到了0.17%的空间AP,比基线高31%。我们的解决方案在公共排行榜上获得了第一个排名。我们的代码可在https://github.com/facebookresearch/vq2d_cvpr上公开获取。
translated by 谷歌翻译
最近,与常规像素的隐性表示相比,视频的图像隐式神经表示,其有希望的结果和迅速的速度因其有希望的结果和迅速的速度而受欢迎。但是,网络结构内的冗余参数在扩大理想性能时会导致大型模型大小。这种现象的关键原因是神经的耦合公式,该公式直接从框架索引输入中输出视频帧的空间和时间信息。在本文中,我们提出了E-NERV,它通过将图像的隐式神经代表分解为单独的空间和时间上下文来显着加快神经的速度。在这种新公式的指导下,我们的模型大大降低了冗余模型参数,同时保留表示能力。我们从实验上发现,我们的方法可以通过更少的参数改善性能,从而使收敛的速度更快地提高了$ 8 \ times $。代码可在https://github.com/kyleleey/e-nerv上找到。
translated by 谷歌翻译
Temporal action detection (TAD) with end-to-end training often suffers from the pain of huge demand for computing resources due to long video duration. In this work, we propose an efficient temporal action detector (ETAD) that can train directly from video frames with extremely low GPU memory consumption. Our main idea is to minimize and balance the heavy computation among features and gradients in each training iteration. We propose to sequentially forward the snippet frame through the video encoder, and backward only a small necessary portion of gradients to update the encoder. To further alleviate the computational redundancy in training, we propose to dynamically sample only a small subset of proposals during training. Moreover, various sampling strategies and ratios are studied for both the encoder and detector. ETAD achieves state-of-the-art performance on TAD benchmarks with remarkable efficiency. On ActivityNet-1.3, training ETAD in 18 hours can reach 38.25% average mAP with only 1.3 GB memory consumption per video under end-to-end training. Our code will be publicly released.
translated by 谷歌翻译
在本文中,我们分享了我们努力建立能够翻译一千多种语言的实用机器翻译(MT)系统的发现。我们在三个研究领域中描述了结果:(i)通过利用半监督预训练的语言识别和开发数据驱动的过滤技术来构建1500多种语言的清洁,网挖数据集; (ii)通过利用大规模的多语言模型来开发用于服务不足的语言的实用MT模型,该模型训练了有监督的并行数据,以使用100多种高资源语言和单语言数据集,以增加1000多种语言; (iii)研究这些语言的评估指标的局限性,并对我们MT模型的输出进行定性分析,突出显示了这些类型模型的几种频繁误差模式。我们希望我们的工作为旨在为当前研究的语言构建MT系统的从业者提供有用的见解,并突出显示可以补充Data-Sparse设置中大量多语言模型的弱点的研究方向。
translated by 谷歌翻译
视频中的时间语言接地旨在本地化与给定查询句子相关的时间范围。以前的方法将其视为边界回归任务或跨度提取任务。本文将向视频阅读理解的临时语言配制,并提出了一个关系感知网络(RANET)来解决它。该框架旨在借助于粗细选择查询交互和选择选择关系构建,从预定义答案中选择视频时刻选择。建议选择查询互操作器以在句时刻和令牌时刻同时匹配视觉和文本信息,导致粗略和细微的跨模型交互。此外,通过利用图形卷积来引入新的多项选择关系构造函数,以捕获视频时刻选择的依赖性,以获得最佳选择选​​择。关于活动网络标题,炸玉米饼和Charades-sta的广泛实验证明了我们解决方案的有效性。代码已提供。
translated by 谷歌翻译
The data consistency for the physical forward model is crucial in inverse problems, especially in MR imaging reconstruction. The standard way is to unroll an iterative algorithm into a neural network with a forward model embedded. The forward model always changes in clinical practice, so the learning component's entanglement with the forward model makes the reconstruction hard to generalize. The proposed method is more generalizable for different MR acquisition settings by separating the forward model from the deep learning component. The deep learning-based proximal gradient descent was proposed to create a learned regularization term independent of the forward model. We applied the one-time trained regularization term to different MR acquisition settings to validate the proposed method and compared the reconstruction with the commonly used $\ell_1$ regularization. We showed ~3 dB improvement in the peak signal to noise ratio, compared with conventional $\ell_1$ regularized reconstruction. We demonstrated the flexibility of the proposed method in choosing different undersampling patterns. We also evaluated the effect of parameter tuning for the deep learning regularization.
translated by 谷歌翻译
Data valuation, especially quantifying data value in algorithmic prediction and decision-making, is a fundamental problem in data trading scenarios. The most widely used method is to define the data Shapley and approximate it by means of the permutation sampling algorithm. To make up for the large estimation variance of the permutation sampling that hinders the development of the data marketplace, we propose a more robust data valuation method using stratified sampling, named variance reduced data Shapley (VRDS for short). We theoretically show how to stratify, how many samples are taken at each stratum, and the sample complexity analysis of VRDS. Finally, the effectiveness of VRDS is illustrated in different types of datasets and data removal applications.
translated by 谷歌翻译
在智能系统(例如自动驾驶和机器人导航)中,轨迹预测一直是一个长期存在的问题。最近在大规模基准测试的最新模型一直在迅速推动性能的极限,主要集中于提高预测准确性。但是,这些模型对效率的强调较少,这对于实时应用至关重要。本文提出了一个名为Gatraj的基于注意力的图形模型,其预测速度要高得多。代理的时空动力学,例如行人或车辆,是通过注意机制建模的。代理之间的相互作用是通过图卷积网络建模的。我们还实施了拉普拉斯混合物解码器,以减轻模式崩溃,并为每个代理生成多种模式预测。我们的模型以在多个开放数据集上测试的更高预测速度与最先进的模型相同的性能。
translated by 谷歌翻译
在高光谱图像分类(HSI)任务中,忽略了包括有关土地覆盖类别的大量先验知识在内的文本信息。有必要探索语言模式在协助HSI分类方面的有效性。此外,大规模训练的图像文本基础模型在各种下游应用中都表现出了出色的性能,包括零拍传输。但是,大多数领域的概括方法从未解决过采矿语言模态知识以提高模型的概括性能。为了弥补上述不足的不足,提出了一个语言感知的域概括网络(LDGNET),以从跨域共享的先验知识中学习跨域不变的表示。所提出的方法仅在源域(SD)上训练,然后将模型传输到目标域(TD)。包括图像编码器和文本编码器在内的双流架构用于提取视觉和语言特征,其中粗粒和细粒度的文本表示旨在提取两个层次的语言特征。此外,语言特征被用作跨域共享的语义空间,并且通过在语义空间中的对比度学习完成视觉语言对齐。与最先进的技术相比,三个数据集上的广泛实验证明了该方法的优越性。
translated by 谷歌翻译